Coresets and streaming algorithms for the k-means problem and related clustering objectives

نویسنده

  • Melanie Schmidt
چکیده

The continuing technological advances in different areas represent a challenge for researchers in computer science and in particular in the area of algorithms and theory. The gap between processing speed and data volume increases constantly, even though the performance of computers and their central processing units increases at a fast rate. This is because the data that surrounds us multiplies at an even more rapid pace. One example for the phenomenon is the Large Hadron Collider (CERN) that generates more than half a gigabyte of data every second. Even algorithms with linear running time are here too slow if they need random access to the data. Data stream algorithms are algorithms that only need one pass over the data to (approximately) solve a problem. Their memory usage is usually polynomial in the logarithm of the input size. Ideally, a data stream algorithm can process the data directly while it is created. In my thesis, I consider k-means clustering. Given n points in the d-dimensional Euclidean space R, the k-means problem is to compute k centers which minimize the sum of the squared distances of all points to their closest center. The centers can be chosen arbitrarily from R. For a given solution, i. e., a set of k centers, we say that the sum of the squared distances is the k-means cost of this solution. The k-means problem has been studied for sixty years and often occurs in machine learning, also as a subproblem. In the context of data streams, a popular technique to solve the k-means problem is the computation of coresets. A coreset for a point set P is a (usually much smaller) point set S which has approximately the same cost as P for any possible solution. More precisely and defined for the k-means problem, a (1 + ε)-coreset for an ε ∈ (0, 1) is a set S that satisfies that the cost of S for any set of k centers C is at least an ε-fraction off the cost of P with the same centers C. A coreset computation is often first designed as a polynomial algorithm with random access to the data. Then, the algorithm is converted into a data stream algorithm by using a technique which is known as Merge-and-Reduce. By using Merge-and-Reduce, the memory usage of the algorithm is usually increased by a factor which is polynomial in log n. In joint work with Hendrik Fichtenberger, Marc Bury (né Gillé), Chris Schwiegelshohn and Christian Sohler, I developed a data stream algorithm for the k-means problem which does not use Merge-and-Reduce. It processes the input points one by one and directly inserts them into an appropriate data structure. We use a data structure which is used in BIRCH (Zhang, Ramakrishnan, Livny, 1997), an algorithm which is very popular in practical applications. By analyzing and improving the data structure, we could develop an algorithm which computes a (1 + ε)-coreset in the data stream model and that uses pointwise updates. Our algorithm is named BICO as a combination of BIRCH and the term coreset. The memory usage of BICO is bounded by O(k · log n · ε−(d+1)) if the dimension of the input points is a constant. We implemented a slightly modified version of BICO and combined it with an algorithm for the k-means problem which is known for its good results in practical applications. In an experimental study, we verified that the combined implementation computes solutions with high quality while it is much faster than other implementations that compute solutions of high quality. Our work was published at the

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

We prove that the sum of the squared Euclidean distances from the n rows of an n×d matrix A to any compact set that is spanned by k vectors in R can be approximated up to (1+ε)-factor, for an arbitrary small ε > 0, using the O(k/ε)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1+ε)approximated by an optimal k-means cl...

متن کامل

On the Sensitivity of Shape Fitting Problems

In this article, we study shape fitting problems, -coresets, and total sensitivity. We focus on the (j, k)-projective clustering problems, including k-median/k-means, k-line clustering, j-subspace approximation, and the integer (j, k)-projective clustering problem. We derive upper bounds of total sensitivities for these problems, and obtain -coresets using these upper bounds. Using a dimension-...

متن کامل

On k-Median clustering in high dimensions

We study approximation algorithms for k-median clustering. We obtain small coresets for k-median clustering in metric spaces as well as in Euclidean spaces. Specifically, in IR, those coresets are of size with only polynomial dependency on d. This leads to a (1 + ε)-approximation algorithm for kmedian clustering in IR, with running time O(ndk + 2 O(1) dn), for any σ > 0. This is an improvement ...

متن کامل

GROUND MOTION CLUSTERING BY A HYBRID K-MEANS AND COLLIDING BODIES OPTIMIZATION

Stochastic nature of earthquake has raised a challenge for engineers to choose which record for their analyses. Clustering is offered as a solution for such a data mining problem to automatically distinguish between ground motion records based on similarities in the corresponding seismic attributes. The present work formulates an optimization problem to seek for the best clustering measures. In...

متن کامل

A Clustering Based Location-allocation Problem Considering Transportation Costs and Statistical Properties (RESEARCH NOTE)

Cluster analysis is a useful technique in multivariate statistical analysis. Different types of hierarchical cluster analysis and K-means have been used for data analysis in previous studies. However, the K-means algorithm can be improved using some metaheuristics algorithms. In this study, we propose simulated annealing based algorithm for K-means in the clustering analysis which we refer it a...

متن کامل

Streaming Algorithms for k-Means Clustering with Fast Queries

We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. When compared with the current state-of-the-art, our methods provide a substantial improvement in the time to answer a query for cluster centers, while retaining the desirable properties of provably small approximation error, and low space usage. Our algorithms are based on a no...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014